This Jupyter Notebook will host all the calculations and data rendering that has been done to discuss the results in the PDF report.
The data that has been used in this study has been taken from World Bank and is stored in the file Data_Extract_From_World_Development_Indicators.xlsx. The data collected is for all the countries for the indicators mentioned in the PDF report. The goal is to use the data from all countries in order to run the regression algorithm and then interpret the results and share insights for the 4 countries that are chosen in this study.
Before we proceed with any code, we first import all the libraries that we will be using in the future.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from patsy import dmatrices
from statsmodels.stats.outliers_influence import variance_inflation_factor
from scipy.stats import shapiro
import statsmodels.stats.api as sms
from statsmodels.compat import lzip
import statsmodels.api as sm
%matplotlib inline
Before we proceed with our analysis of the data, we need to do some Data wrangling and transformations to represent the data in the desirable form. We will discuss these one by one below
The data extracted from World Bank has the Years in the columns and the indicator values in the rows. We will transform this into the indicators in the columns and the Years (Time series) in the rows. In order to do this we have made use of a python library called pandas which is a data analysis and manipulation tool that represents our data in what we call Data Frames. The python code used to perform this transformation is given below
data = pd.read_excel("Data_Extract_From_World_Development_Indicators.xlsx", sheet_name="Data")
income_level_data = pd.read_excel("Income_Categories.xlsx", sheet_name="Metadata - Countries")
def clean_data(df):
'''
Take a DataFrame that represents a given country, transpose it, rename the indices
create a country column, delete the country and series rows.
'''
df = df.transpose()
df.columns = df.iloc[1]
country = df.iloc[0, 0]
df = df.drop(df.index[0])
df = df.drop(df.index[0])
df["Country"] = country
return df
df = pd.DataFrame()
# 264 countries in the dataset
for i in range(264):
df = df.append(clean_data(data[i*10:(i+1)*10]))
df['IncomeGroup'] = ""
for c in df["Country"].unique():
if income_level_data[income_level_data["Country"] == c]['IncomeGroup'].any():
df.loc[df["Country"] == c, 'IncomeGroup'] = income_level_data[income_level_data["Country"] == c]['IncomeGroup'].values[0]
else:
df.loc[df["Country"] == c, 'IncomeGroup'] = "NA"
df
df.to_excel("Clean_Data.xlsx")
Net official development assistance (ODA) consists of loans and grants that are recieved by nations to support their economy. Certian countries (mostly high income countries) do not have any Net ODA receieved. However, they are involved in providing assistance to other developing nations. The World Bank data for some countries refelcts a negative value for Net ODA receieved and for other countries has no values for Net ODA received but on the other hand has values for Net ODA provided to other countries. We will combine this data and represent this in a single variable under Net ODA received by subtracting the Net ODA provided. This way, we will have a standardized representation of data. The reference of this can be found here
This calculation is done in excel and the results are stored directly in the workbook.
For the python libraries to work with the data, there are some cleaning activities that need to be performed. 1) The Year column is cleaned to represent only numbers
2) The empty values where data is not found is replaced with empty cells instead of '..'
The output data is stored in the file Clean_Data.xlsx
Before we proceed with the calculation of the statistics for the given data, we see that we have some missing data points in our dataset and we have to deal with them. There are broadly 2 ways in which we can deal with missing data
Drop missing data
Data Imputation
Data Imputation is the process of filling the missing data and there are various techniques to do so (One of them is to calculate the mean value of the rest of the available scores of the variable and fill the missing entries with the calculated mean). This approach is fine if the variables under consideration have no relationship between them and if we do not intend to perform any correlational study between variables. In our case, the best approach would be to dropping missing data would be the best approach
There are 2 kinds of missing data that are present in our dataset.
Regression algorithms require data to be available for all the indices in a given sample. We have many years where we do not have the data for a few indices. We will drop such data points as shown below
data = pd.read_excel("Clean_Data.xlsx", sheet_name="Data")
data.info()
The column names are descriptions of the indicators. We shall rename the column names for easier access
data = data.rename(columns={'Mortality rate, under-5 (per 1,000 live births)': 'Mortality',
'Population growth (annual %)': 'Population_Growth',
'Fertility rate, total (births per woman)':'Fertility',
'Age dependency ratio (% of working-age population)':'Age_Dependency',
'GDP per capita (current US$)':'GDP',
'Inflation, consumer prices (annual %)':'Inflation',
'Net ODA received (% of GNI)':'Net_ODA',
'Unemployment, total (% of total labor force) (modeled ILO estimate)':'Unemployment',
'Adjusted savings: education expenditure (% of GNI)':'Education_Expenditure'})
data_filtered = data.dropna()
data_filtered.info()
There are certain entries where data is available for some countries and for some countries its not. This causes a bias in the estimate of statistics measures such as mean, median and so on. For example, if the GDP for a high income country is available for 40 years and for a low income country it is only avaiable for 20 years, calculating the mean value of GDP for the high income country for all 40 years will induce a bias.
We handle the missing data as per country for the 4 countries we have considered as shown below
years = np.arange(1960,2020,1)
data_filtered_Canada = data_filtered[data_filtered['Country']=='Canada']
data_filtered_Indonesia = data_filtered[data_filtered['Country']=='Indonesia']
data_filtered_India = data_filtered[data_filtered['Country']=='India']
data_filtered_Sudan = data_filtered[data_filtered['Country']=='Sudan']
available_years = []
for i in years:
if i in data_filtered_Canada.Year.unique() and i in data_filtered_Indonesia.Year.unique() and i in data_filtered_India.Year.unique() and i in data_filtered_Sudan.Year.unique():
available_years.append(i)
data_filtered = data_filtered[data_filtered.Year.isin(available_years)]
data_filtered.info()
Now that our data is clean, we will import the data and perform some exploratory data analysis to understand our data better. First, we will import the data from the excel into a pandas dataframe as shown below.
data_filtered.describe()
The above statistics are for all the countries combined. However, we are interested in describing the data for individual countries. We will also extract the data of the indices that are calculated for groups of countries such as High Income countries, Low income countries and so on. This data is provided by World Bank. We do that as follows
data_filtered[data_filtered['Country']=='Canada'].describe()
data_filtered[data_filtered['Country']=='Indonesia'].describe()
data_filtered[data_filtered['Country']=='India'].describe()
data_filtered[data_filtered['Country']=='Sudan'].describe()
data_filtered[data_filtered['Country']=='High income'].describe()
data_filtered[data_filtered['Country']=='Upper middle income'].describe()
data_filtered[data_filtered['Country']=='Lower middle income'].describe()
data_filtered[data_filtered['Country']=='Low income'].describe()
data_filtered[data_filtered['Country']=='World'].describe()
In order to better understand our data, we will create some plots as shown below. We make use of python libraries matplotlib and seaborn to do this.
First we create a box and whiskers plot that will help us better understand the distribution of the indices across countries
variables = ['Mortality','Population_Growth','Fertility','Age_Dependency','GDP','Inflation','Net_ODA','Unemployment','Education_Expenditure']
label_names = ['Mortality rate, under-5 (per 1,000 live births)',
'Population growth (annual %)',
'Fertility rate, total (births per woman)',
'Age dependency ratio (% of working-age population)',
'GDP per capita (current US$)',
'Inflation, consumer prices (annual %)',
'Net ODA received (% of GNI)',
'Unemployment, total (% of total labor force) (modeled ILO estimate)',
'Adjusted savings: education expenditure (% of GNI)']
countries = ['Canada','Indonesia','India','Sudan','High income','Upper middle income','Lower middle income','Low income','World']
df = data_filtered[data_filtered['Country'].isin(countries)]
for i in range(0,len(variables)):
plt.figure(figsize=(15, 8))
sns.boxplot(x="Country", y=variables[i], data=df);
plt.xticks(rotation=45);
plt.ylabel(label_names[i])
plt.tight_layout()
plt.savefig(str(variables[i]) + '_boxplot' + '.jpg')
Next, we plot the histograms for the indvidual indices for all countries under consideration. We also mark the mean and median values in the histograms
for i in range(0,len(variables)):
for j in range(0,len(countries)):
df=data_filtered[data_filtered['Country']==countries[j]]
variable_name = variables[i]
xlabel_value = label_names[i]
plt.figure(figsize=(12, 5))
plt.hist(data=df,x=variable_name,bins=20,color='teal')
plt.xlabel(xlabel_value,fontsize=12);
min_ylim, max_ylim = plt.ylim()
ax = plt.gca()
plt.axvline(df[variable_name].mean(), color='blue', linestyle='dashed', linewidth=1)
plt.text(0.80,0.98, 'Mean: {:.2f}'.format(df[variable_name].mean()), horizontalalignment='right',verticalalignment='top',transform=ax.transAxes,color='blue')
plt.axvline(df[variable_name].median(), color='red', linestyle='dashed', linewidth=1)
plt.text(0.80,0.9, 'Median: {:.2f}'.format(df[variable_name].median()), horizontalalignment='right',verticalalignment='top',transform=ax.transAxes,color='red')
plt.title(df['Country'].unique()[0],fontsize=12,weight='bold')
plt.savefig(str(variable_name) + '_' + str(df['Country'].unique()[0]) + '.jpg')
Next, we also plot a time series line plot in order to see the trend of indices with respect to time.
for i in range(0,len(variables)):
plt.figure(figsize=(15,10))
sns.pointplot(x='Year',y=variables[i],data =data_filtered[data_filtered['Country'].isin(countries)],hue='Country')
plt.ylabel(label_names[i])
plt.xticks(rotation=90);
plt.savefig(str(variables[i]) + '_pointplot.jpg')
Before we proceed with regression, we must process our data to obtain better results
We will first drop the data for country groups such as the World average values, Arab World, Sub saharan country averages and so on. These represent the average value of all countries in that region. Since we are considering individual country data, we do not need the corresponding averages calculated by World bank
country_list = list(data_filtered['Country'].unique())
country_list = country_list[0:country_list.index('Zimbabwe')+1]
data_filtered = data_filtered[data_filtered['Country'].isin(country_list)]
Next, we will log transform the Mortality rate and the GDP variables for two reasons. One, the log of the variables will better represent them as log(GDP) and log(Mortality) would then be in 100's of the actual GDP and mortality. This is particularly useful because the rest of our independent variables represent percentages and smaller values in magnitude overall. Secondly, we want to capture how multiplicative (percentage) changes in GDP affecting the model. This is achieved by taking the log of GDP.
data_filtered['Mortality_log'] = np.log(data_filtered['Mortality'])
data_filtered['GDP_log'] = np.log(data_filtered['GDP'])
Next, we will plot the data to have a look at the relationship of every variable with each other. This will give us a good idea of any preprocessing that we must do before we run our regression model. We plot the data using seaborn's pairplot function that is shown below
plt.figure(figsize=(15,15))
sns.pairplot(data_filtered[['Mortality_log', 'Population_Growth', 'Fertility',
'Age_Dependency', 'GDP_log', 'Inflation', 'Net_ODA', 'Unemployment',
'Education_Expenditure']]);
plt.tight_layout()
plt.savefig('Pairplot_before.jpg');
From the data, we can see that there are some issues that we need to take care of.
1) The Inflation index has some outliers in the data that might be bad for the regression algorithm
We can take care of this issue by excluding the outliers from our dataset for regression. After inspection, we can see there are some values of inflation that are above 200 and below -10 that might have been recorded due to false entries or some special situations that might have caused such values to occur. We will exclude these samples from our dataset.
data_filtered = data_filtered[data_filtered['Inflation']<200]
data_filtered = data_filtered[data_filtered['Inflation']>-10]
2) Population Growth seems to have outliers too.
We can take care of this issue as well by excluding the outliers from our dataset. We have some samples where the reported population growth is less than -2% and some samples have population growth above 6% which are not representative of the overall data. We will exclude these samples from our dataset as well.
data_filtered = data_filtered[data_filtered['Population_Growth']>-2]
data_filtered = data_filtered[data_filtered['Population_Growth']<6]
3) Net_ODA seems to have outliers too.
We can take care of this issue as well by excluding the outliers from our dataset. We have some samples where the reported Net ODA is above 50% which are not representative of the overall data. We will exclude these samples from our dataset as well.
data_filtered = data_filtered[data_filtered['Net_ODA']<50]
A very important point to be noted here is that the data that has been dropped above does not include any samples from the 4 countries of interest. We have dropped the data that do not represent the usual trend in the indices due to probable error in recording the information or natural situations such as famine, war or national emergencies that might have caused the outliers. We drop such information because our goal is to analyse the 4 countries we have chosen and to make comments on them. Data from other countries are used to make the linear regression model more robust and have a better overall fit.
4) Fertility and Age_Dependency seem to have a very high correlation amongst themselves. Correlation between independent variables is called multicollinearity and this causes issues in the regression algorithm too.
One way to analyze multicollinearity is using the Variance inflation factor (VIF). Values greater than 5 indicate the presence of high multicollinearity in our data. Let us calculate the VIF as shown below
# get y and X dataframes based on this regression:
y_vif, X_vif = dmatrices('Mortality_log ~ Population_Growth + Fertility + Age_Dependency + GDP_log + Inflation + Net_ODA + Unemployment + Education_Expenditure' , data_filtered, return_type='dataframe')
# For each X, calculate VIF and save in dataframe
vif = pd.DataFrame()
vif["VIF Factor"] = [variance_inflation_factor(X_vif.values, i) for i in range(X_vif.shape[1])]
vif["features"] = X_vif.columns
vif
Note: The "Intercept" value shown here is just the constant intercept value (alpha) that gets added in a linear regression model. We do not need to consider the VIF for the intercept.
This shows us that Fertility and Age_Dependency are highly correlated. One way to deal with this issue is to eliminate one of the variable. The other better way is to combine the 2 indices into a single variable that represents both indices. We do this by considering a new variable called Fertility_Age_Dependency_log that considers the log of the product of the 2 variables as shown below
data_filtered['Fertility_Age_Dependency_log'] = np.log(data_filtered['Fertility'] * data_filtered['Age_Dependency'])
Now let us run the VIF check once again and also plot our new data to visualize the scatterplots again
# get y and X dataframes based on this regression:
y_vif, X_vif = dmatrices('Mortality_log ~ Population_Growth + Fertility_Age_Dependency_log + GDP_log + Inflation + Net_ODA + Unemployment + Education_Expenditure' , data_filtered, return_type='dataframe')
# For each X, calculate VIF and save in dataframe
vif = pd.DataFrame()
vif["VIF Factor"] = [variance_inflation_factor(X_vif.values, i) for i in range(X_vif.shape[1])]
vif["features"] = X_vif.columns
vif
plt.figure(figsize=(15,15))
sns.pairplot(data_filtered[['Mortality_log', 'Population_Growth', 'Fertility_Age_Dependency_log', 'GDP_log', 'Inflation', 'Net_ODA', 'Unemployment',
'Education_Expenditure']]);
plt.tight_layout()
plt.savefig('Pairplot_after.jpg')
Now that we have performed the preprocessing on our data, we will declare dummy variables for the 4 income groups. We will include these variables in our regression model.
dummy = pd.get_dummies(data_filtered['IncomeGroup'])
data_filtered['High_Income_Dummy'] = dummy['High income']
data_filtered['Upper_Middle_Income_Dummy'] = dummy['Upper middle income']
data_filtered['Lower_Middle_Income_Dummy'] = dummy['Low income']
data_filtered.info()
data_filtered.describe()
We will save this in a separate excel for future references. The final output data is stored in the file Final_Data.xlsx
data_filtered.to_excel("Final_Data.xlsx")
In order to test the normality of the dependent variable, we use the Shapiro Wilk hypothesis test. The Null and Alternate hypothesis are mentioned below
H0: The empirical distribution follows the normal distribution
HA: The empirical distribution is non normal
shapiro_wilk_test = shapiro(data_filtered['Mortality_log'])
shapiro_wilk_test
Based on the normality test of the dependent variable, it can be seen that the Shapiro Wilk test has rejected the Null hypothesis that the Mortality(log) rate is normal in nature. Also, one important point to be noted is the normality of the dependent variable is calculated for all the countries (ie all samples included) because we will be using all samples in our regression algorithm.
The dependent variable is non-normal, however, the important measure is if our residuals obtained after regression analysis are normal. We will verify this once we run our regression algorithm.
Before we run the regression analysis, our goal is to first try to analyze the Mortality across our 4 countries under consideration. We can use a related samples t-test to compare the means. However, in order to do that, we need to first check if the Mortality is normal in all 4 countries. Else we will have to make use of other non-parametric procedures.
We do this with the help of the same Shapiro Wilk test that was shown above.
shapiro_wilk_results = {}
#Countries under consideration
countries = ['Canada','Indonesia','India','Sudan']
for country in countries:
shapiro_wilk_test = shapiro(data_filtered[data_filtered['Country']==country].loc[:, "Mortality"])
shapiro_wilk_results[country] = shapiro_wilk_test
shapiro_wilk_results
shapiro_wilk_results = {}
#Countries under consideration
countries = ['Canada','Indonesia','India','Sudan']
for country in countries:
shapiro_wilk_test = shapiro(data_filtered[data_filtered['Country']==country].loc[:, "Mortality_log"])
shapiro_wilk_results[country] = shapiro_wilk_test
shapiro_wilk_results
The results of the Shapiro Wilk test for normality shows that the Mortality as well as the log transformed Mortality_log dependent variable is normal since we are not able to reject the Null hypothesis for either country.
Now we can proceed with the related samples t-test.
Note: The python library scipy does not have the provision to conduct a one tailed t-test. So we will perform a two tailed test with the following hypothesis
H0: The means difference between two samples is 0
HA: The means difference between two samples is not 0
If we reject the Null hypothesis, we say the mean difference between mortality is not same between 2 countries.
from scipy import stats
countries = ['Canada','Indonesia','India','Sudan']
country_A = []
country_B = []
comparisons = []
for i in range(0,len(countries)):
for j in range(i+1,len(countries)):
comparisons.append(stats.ttest_rel(data_filtered[data_filtered['Country']==countries[i]]['Mortality'],data_filtered[data_filtered['Country']==countries[j]]['Mortality']))
country_A.append(countries[i])
country_B.append(countries[j])
df_comparison = pd.DataFrame(list(zip(country_A, country_B,comparisons)),columns =['Country A', 'Country B','Related samples T Test for mean Mortality'])
df_comparison
from scipy import stats
countries = ['Canada','Indonesia','India','Sudan']
country_A = []
country_B = []
comparisons = []
for i in range(0,len(countries)):
for j in range(i+1,len(countries)):
comparisons.append(stats.ttest_rel(data_filtered[data_filtered['Country']==countries[i]]['Mortality_log'],data_filtered[data_filtered['Country']==countries[j]]['Mortality_log']))
country_A.append(countries[i])
country_B.append(countries[j])
df_comparison = pd.DataFrame(list(zip(country_A, country_B,comparisons)),columns =['Country A', 'Country B','Related samples T Test for mean Mortality(log)'])
df_comparison
From the above calculations, it is clear that the mean mortality for all 4 countries are different than each other.
Also, given the p and t values from a two-tailed test, we can say that we would reject the null hypothesis of a less-than one tailed t-test when p/2 < alpha and t < 0. Clearly that is the case with all comparisons above.
Therefore we can safely say, the mean mortality follows the order Canada < Indonesia < India < Sudan.
The mean value of Mortality and the corresponding 95% confidence intervals for the four countries are calculated as shown below.
countries = ['Canada','Indonesia','India','Sudan']
confidence_intervals = []
mean_Mortality = []
for i in range(0,len(countries)):
data = data_filtered[data_filtered['Country']==countries[i]]['Mortality']
confidence_intervals.append(stats.t.interval(alpha=0.95, df=len(data)-1, loc=np.mean(data), scale=stats.sem(data)) )
mean_Mortality.append(data.mean())
df_confidence = pd.DataFrame(list(zip(countries, mean_Mortality,confidence_intervals)),columns =['Country', 'Mean Mortality','95% Confidence Interval'])
df_confidence
countries = ['Canada','Indonesia','India','Sudan']
confidence_intervals = []
mean_Mortality = []
for i in range(0,len(countries)):
data = data_filtered[data_filtered['Country']==countries[i]]['Mortality_log']
confidence_intervals.append(stats.t.interval(alpha=0.95, df=len(data)-1, loc=np.mean(data), scale=stats.sem(data)) )
mean_Mortality.append(data.mean())
df_confidence = pd.DataFrame(list(zip(countries, mean_Mortality,confidence_intervals)),columns =['Country', 'Mean Mortality(log)','95% Confidence Interval'])
df_confidence
Now that our data is ready, we can run our regression model. We will first figure out the value of lambda for the L2 regularization parameter in our Ridge regression model. We find an approximate value of lambda by plotting the change in the residuals (or Sum of squared errors) versus the lambda value as shown below
X = data_filtered[['Population_Growth', 'Fertility_Age_Dependency_log', 'GDP_log',
'Inflation', 'Net_ODA', 'Unemployment','Education_Expenditure',
'High_Income_Dummy','Upper_Middle_Income_Dummy','Lower_Middle_Income_Dummy']]
X = sm.add_constant(X)
y = data_filtered[['Mortality_log']]
model = sm.OLS(y, X)
results_normal_ols = model.fit()
frames = []
for a in np.arange(0, 0.25, 0.05):
results_ridge = model.fit_regularized(L1_wt=0, alpha=a, start_params=results_normal_ols.params)
results_ridge_fit = sm.regression.linear_model.OLSResults(model,results_ridge.params,model.normalized_cov_params)
frames.append(np.append(results_ridge.params, results_ridge_fit.ssr))
df = pd.DataFrame(frames, columns=list(X.columns.values) + ['ssr*'])
df.index=np.arange(0, 0.25, 0.05)
df.index.name = 'lambda*'
df.T
fig, ax = plt.subplots(1, 2, figsize=(14, 4));
ax[0] = df.iloc[:, :-1].plot(ax=ax[0])
ax[0].set_title('Coefficients')
ax[1] = df.iloc[:, -1].plot(ax=ax[1])
ax[1].set_title('SSR');
plt.savefig('lambda_selection.jpg');
We see that the value of SSR (Sum of squared errors) starts falling at alpha = 0.05. Thus we choose alpha to be 0.05 in our Ridge Regression model.
Our final model and its parameters are shown below
results_ridge = model.fit_regularized(L1_wt=0, alpha=0.05, start_params=results_normal_ols.params)
final = sm.regression.linear_model.OLSResults(model,
results_ridge.params,
model.normalized_cov_params)
final.summary()
The coef column gives us the values of the regression coefficients corresponding to all the independent variables. It can be seen from high value of regression coefficients for Fertility_Age_Dependency_log that the combination of Fertility and Age Dependency has a very high impact on Mortality. GDP_log also seems to have a high impact on Mortality but the relationship is negative. Education_Expenditure also has a reasonable negative impact on Mortality. Unemployment has a marginal positive impact on Mortality and Inflation and Net ODA seem to have almost negligible impact on Mortality. All of these regression coefficients have sufficiently large t-statistic and a low p-value which means they are significant. The coefficient for Population Growth is relatively low and also the t-statistic is low. The p-value is greater than 0.05 which means it is not statistically significant either.
The way we interpret the coefficients is shown below with an example considering Unemployment.
The mean value of the log of Mortality rate under 5(per 1000 live births) changes by 0.0184 corresponding to a unit increase in the Unemployment rate (% of total labor force)
R-squared signifies the amount of variability in our dependent variable is explained by our independent variables. The value of 0.867 means our model is able to explain 86.7% of variability
F-Statistic and the Prob(F-Statistic) signify the overall significance of the regression. The Null Hypothesis here being "all the regression coefficients are equal to zero". Clearly, the Prob(F-Statistic) is close to 0 and our F-Statistic has a high value indicating the regression is significant
Although, a high value of R-squared is desirable, it is also important that the residuals are normally distributed and one of the assumptions of Linear regression is Homoskedasticity, which means that the variance of residual is the same for any values of independent variables. We will analyze these aspects in the section below
Now that we have the regression coefficients calculated, we can now use them to calculated the Estimated/Predicted Mortality for all of our entries.
We can then analyze the difference between what the regression model predicts Estimated/Predicted Mortality to be and what the actual observed value of Mortality is. We can plot the Residuals using a histogram as shown below.
Note: When analyzing the residuals, we are only considering the 4 countries under our consideration
data_filtered['Predicted_Mortality_log'] = final.predict(X)
data_filtered['Residuals'] = data_filtered['Mortality_log'] - data_filtered['Predicted_Mortality_log']
countries = ['Canada','Indonesia','India','Sudan']
plt.figure()
plt.hist(data=data_filtered,x='Residuals',bins=20,color='teal');
plt.ylabel('Residuals',fontsize=12);
plt.title('Residuals Overall',fontsize=12,weight='bold');
plt.xticks(np.arange(-3,3,0.5));
plt.savefig('Residuals_Overall.jpg');
for i in countries:
plt.figure()
df=data_filtered[data_filtered['Country'] == i]
plt.hist(data=df,x='Residuals',color='teal');
plt.ylabel('Residuals',fontsize=12);
plt.title('Residuals ' + i,fontsize=12,weight='bold');
plt.xticks(np.arange(-3,3,0.5));
plt.savefig('Residuals_' + i + '.jpg');
In addition to plotting the residuals, we can run some tests to check if our regression test has been successful.
1) First, we will test for normality of the residuals using the Shapiro Wilk test as shown below.
shapiro_wilk_results = {}
shapiro_wilk_results['Overall'] = shapiro(data_filtered['Residuals'])
#Countries under consideration
countries = ['Canada','Indonesia','India','Sudan']
for country in countries:
residual = data_filtered[data_filtered['Country']==country].loc[:, "Residuals"]
shapiro_wilk_test = shapiro(residual)
shapiro_wilk_results[country] = shapiro_wilk_test
shapiro_wilk_results
We can observe that the residuals are not normal for the overall data including all countries. But when we consider our 4 countries, the test says the residuals are normal for all 4 countries by failing to reject the Null hypothesis with p>0.05
2) Next, we will check for Heteroskedasticity in our residuals. We begin by first plotting our residual values with respect to time and also with respect to the predicted/estimated values of our dependent variable.
plt.figure()
sns.scatterplot(x='Year',y='Residuals',data = data_filtered);
plt.savefig('Scatterplot_Residuals_Year.jpg');
plt.figure()
sns.scatterplot(x='Predicted_Mortality_log',y='Residuals',data = data_filtered);
plt.savefig('Scatterplot_Residuals_Predicted.jpg');
From the plots above, it looks like our residuals do not suffer from Heteroskedasticity by a large margin because the variance in the value of residuals is not affected by the Predicted value of the dependent variable. The variance in residuals is also not constant.
To confirm if our data has Heteroskedasticity, we can use a hypothesis test. The test we have chosen is the Breusch Pagan test for heteroskedasticity and is shown below.
H0: The error(residual) variances are all equal
HA: The error(residual) variances are not equal. More specifically, as X increases, the variances increase (or decrease)
names = ['Lagrange Multiplier statistic:', 'LM test\'s p-value:', 'F-statistic:', 'F-test\'s p-value:']
test = sms.het_breuschpagan(final.resid, final.model.exog)
lzip(names, test)
According to the results of this test, we can say that our residuals suffer from Heteroskedasticity because p<0.05 which means we reject the Null hypothesis. However, the F-statistic and the Lagrange multiplier statistic values are considerably low. This means, even though the test has statistical significance, the practical impact is low.
Based on the regression model’s results we can say, the social and economic groups can be arranged as follows based on their impact on Child Mortality
Household Statistics > Economy > Education > Population
Household statistics amongst other social and economic factors have a very high correlation with Child Mortality. This means, Child Mortality has a strong relationship with both the Fertility rate in women as well as the Age Dependency Ratio. As the Fertility rate and Age Dependency Ratio increased, so did Child Mortality. Under Economy, the only factor that has a strong relationship with Child Mortality was GDP and the relationship was negative. This meant that as GDP increased, Child Mortality decreased. Other economic factors (Inflation, Net ODA received and Unemployment) did not have a strong relationship with Child Mortality. This might be due to outliers and high degree of variance in these indices which did not correlate to a similar degree of change in Child Mortality. Education Expenditure had a weak but a considerable negative relationship with Child Mortality. Population Growth did not have a considerable relationship with Child Mortality and neither was the regression coefficient significant.
Note: Fertility rate in women and Age Dependency ratio had a high correlation amongst themselves and so the model considered a single variable to represent both indices.
Based on the value of residuals for each country, we can see that a) The regression algorithm predicted a lower value for Child Mortality than the actual value for India and Indonesia b) For Sudan, the predicted value was higher than what was actually observed. c) For Canada, the prediction done by the multi linear regression was in line with the actual values that were observed
for i in ['Mortality_log', 'Population_Growth', 'Fertility_Age_Dependency_log', 'GDP_log', 'Inflation', 'Net_ODA', 'Unemployment',
'Education_Expenditure']:
g = sns.FacetGrid(data_filtered[data_filtered['Country'].isin(countries)], col="Country")
g.map(sns.scatterplot, i, "Residuals", alpha=.7)
plt.savefig('Residuals_vs_'+ i + '.jpg');
For instance, the above plot reveals that change in population growth does not impact the residuals much for other countries except India. This means, the regression coefficient value for Population growth is not able to explain the rise in Child Mortality with respect to the rise in Population growth in India.